Gemma4 HeteroTP support for NIXL KV connector by ZhanqiuHu · Pull Request #41169 · vllm-project/vllm

ZhanqiuHu · 2026-04-28T20:55:57Z

Depends on #40731

mergify · 2026-04-28T20:56:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZhanqiuHu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request refactors the NIXL connector to use a plan-based transfer architecture, introducing EngineTransferPlan to handle complex transfer geometries for hybrid SSM+Attention and HeteroTP models. While this architecture simplifies the transfer hot path, several critical bugs were identified: the splitting logic incorrectly assumes uniform parameters across all attention groups, and an incorrect boundary parameter breaks Mamba state transfers. Additionally, the block trimming logic must be updated to avoid corrupting Mamba state, and plan generation needs to be more robust when encountering homogeneous remote engines to prevent assertion failures.

gemini-code-assist · 2026-04-28T20:59:35Z

+            remote_tpb = tuple(nixl_agent_meta.tokens_per_block_per_group or ())
+            assert len(remote_tpb) == len(self._group_kinds), (
+                f"Remote tokens_per_block_per_group length "
+                f"{len(remote_tpb)} != {len(self._group_kinds)} groups"
+            )


The assertion len(remote_tpb) == len(self._group_kinds) will fail when connecting to a remote engine that is homogeneous (e.g., an older vLLM version or a different model configuration). In such cases, nixl_agent_meta.tokens_per_block_per_group is None, causing remote_tpb to be an empty tuple, while self._group_kinds is always populated. For homogeneous remotes, remote_tpb should default to a tuple of nixl_agent_meta.block_size repeated for each group.

Suggested change

remote_tpb = tuple(nixl_agent_meta.tokens_per_block_per_group or ())

assert len(remote_tpb) == len(self._group_kinds), (

f"Remote tokens_per_block_per_group length "

f"{len(remote_tpb)} != {len(self._group_kinds)} groups"

)

remote_tpb = tuple(nixl_agent_meta.tokens_per_block_per_group) if nixl_agent_meta.tokens_per_block_per_group is not None else (nixl_agent_meta.block_size,) * len(self._group_kinds)

assert len(remote_tpb) == len(self._group_kinds), (

f"Remote tokens_per_block_per_group length "

f"{len(remote_tpb)} != {len(self._group_kinds)} groups"

)

gemini-code-assist · 2026-04-28T20:59:35Z

+        fa_slot = fa_slot_map.get(p_rank, 0)
+
+        handle: list[tuple[int, int, int]] = []
+        for j, (addr, local_len, dev) in enumerate(src_blocks_data):
+            if j < num_fa_descs:
+                chunk = local_len // fa_num_splits


The splitting logic in build_local_splits_from_plan incorrectly assumes that all attention groups share the same split count and slot mapping as the first group (plan.source_ranks_per_group[0] and plan.rank_to_attention_slot[0]). In HeteroTP models like Gemma4, different attention groups (e.g., SWA vs. FA) can have different numbers of KV heads, which may result in different numbers of remote source ranks (due to GQA deduplication) and different head-to-slot assignments. Using group 0's parameters for all attention descriptors will lead to incorrect memory slicing and data corruption for other groups.

gemini-code-assist · 2026-04-28T20:59:35Z

+            for handle_data in build_local_splits_from_plan(
+                plan,
+                self.src_blocks_data,
+                self.num_descs,
+            ):


Passing self.num_descs as the num_fa_descs boundary to build_local_splits_from_plan is incorrect if self.num_descs represents the total number of descriptors (FA + SSM). This causes the logic in transfer_plan.py to treat all descriptors as attention descriptors (since j < num_fa_descs will always be true), and has_ssm_descs will be incorrectly evaluated as False. This will break the 3-read transfer logic for Mamba state.

gemini-code-assist · 2026-04-28T20:59:35Z

+        # Partial prefix cache hit: trim to the shorter of local/remote.
+        # After ReadSpec construction, local descriptor IDs and remote
+        # block IDs should already have matched lengths per group
+        # (gather-read pairing ensures this).  Trim from the head to
+        # keep the tail (newest blocks).
        remote_block_ids = list(remote_block_ids)
-        for i, remote_group in enumerate(remote_block_ids):
-            num_remote_blocks = len(remote_group)
-            num_local_blocks = len(local_block_ids[i])
-            if not self._is_mamba_group[i]:
-                assert num_local_blocks <= num_remote_blocks
-            # Partial prefix cache hit: just read uncomputed blocks.
-            # Skip mamba groups — their blocks represent full state (conv+ssm),
-            # not per-token data, so trimming would corrupt the transfer.
-            if num_local_blocks < num_remote_blocks and not self._is_mamba_group[i]:
-                remote_block_ids[i] = remote_group[-num_local_blocks:]
+        local_block_ids = list(local_block_ids)
+        for i in range(len(remote_block_ids)):
+            n_local = len(local_block_ids[i])
+            n_remote = len(remote_block_ids[i])
+            n = min(n_local, n_remote)
+            if n_local > n:
+                local_block_ids[i] = local_block_ids[i][-n:]
+            if n_remote > n:
+                remote_block_ids[i] = remote_block_ids[i][-n:]


The new trimming logic in _read_blocks removes the Mamba-specific check that previously prevented trimming for Mamba groups. Mamba blocks represent full state (conv + SSM) rather than incremental per-token data. Trimming these blocks based on a length mismatch between local and remote requests is likely to result in state corruption. The logic should continue to skip trimming for groups where is_ssm is true.

Suggested change

# Partial prefix cache hit: trim to the shorter of local/remote.

# After ReadSpec construction, local descriptor IDs and remote

# block IDs should already have matched lengths per group

# (gather-read pairing ensures this). Trim from the head to

# keep the tail (newest blocks).

remote_block_ids = list(remote_block_ids)

for i, remote_group in enumerate(remote_block_ids):

num_remote_blocks = len(remote_group)

num_local_blocks = len(local_block_ids[i])

if not self._is_mamba_group[i]:

assert num_local_blocks <= num_remote_blocks

# Partial prefix cache hit: just read uncomputed blocks.

# Skip mamba groups — their blocks represent full state (conv+ssm),

# not per-token data, so trimming would corrupt the transfer.

if num_local_blocks < num_remote_blocks and not self._is_mamba_group[i]:

remote_block_ids[i] = remote_group[-num_local_blocks:]

local_block_ids = list(local_block_ids)

for i in range(len(remote_block_ids)):

n_local = len(local_block_ids[i])

n_remote = len(remote_block_ids[i])

n = min(n_local, n_remote)

if n_local > n:

local_block_ids[i] = local_block_ids[i][-n:]

if n_remote > n:

remote_block_ids[i] = remote_block_ids[i][-n:]

for i in range(len(remote_block_ids)):

if self._group_kinds[i].is_ssm:

continue

n_local = len(local_block_ids[i])

n_remote = len(remote_block_ids[i])

n = min(n_local, n_remote)

if n_local > n:

local_block_ids[i] = local_block_ids[i][-n:]

if n_remote > n:

remote_block_ids[i] = remote_block_ids[i][-n:]